## [1] 4898 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The strongest white wine is 14.2% alcohol, and the weakest, 8%. All white wines are acidic ranging from ph 2.7 to 3.8. The median quality of wine on a scale from 0 - 10 is 6 with a mean of 5.9. The highest marked wine had a score of 9 and the lowest 3.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The wine ratings are whole values ranging from 3-9. The distribution appears normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The alcohol level in the dataset appears to be somewhat positively skewed, using the log2 of the value makes it more normalily distributed, however the mode appears to be lower than the median.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH appears normally distributed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density has a couple of outliers, when removed the chart is more normal.
Volatile acidic acid in large quanitites can lead to an unpleasant vinegar taste in wine, I wonder wether the wine quality will correlate to this?
Looks like this distribution maybe bimodal with 2 peaks 1 around 1 and the other around 3, perhaps for different qualities of wine, maybe some fruitier, and others dry?
Sulphates can contribute to levels of sulphor dioxide in wine, so I would assume suplhate levels were correlated to gas levels? In high concentrations this can be detectable in wine, I wonder weather this will affect the quality? Worked out the amount of bound sulphor dioxide by subtracting total by free.
Amount of salt in the wine, I assume at larger concentration it would affect the taste and quality.
There are 4898 observations in the dataset. Qualities range from integer values 3-9, and are normally distributed.
Other observiations: Median Alcohol content is 10.40 Median quality in 6. Most wines in the dataset are dry, with the median residual sugar value being 5.2
The main feature of intrest to me is the quality score, I want to see if we can estimate the quality of wine based on its properties.
I think some of the features of intrest which may affect the quality of wine are volatile.acidity (high levels can make wine taste like vinegar), free.sulfur.dioxide (high levels can be detected by taste/nose) and the ratio of acid to sweet (See below)
I found an article on wikipedia about wine tasting: http://en.wikipedia.org/wiki/Acids_in_wine#In_wine_tasting, it says that an important factor in the quality of wine is the balance of acidity vs. sweetness. So I also calculated a sweet to acid ratio.
wine$acid.sweet.ratio <- (wine$fixed.acidity
+ wine$volatile.acidity) / wine$residual.sugar
summary(log(wine$acid.sweet.ratio))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0160 -0.2992 0.2803 0.4769 1.4040 2.7010
I also added the bound level of sulphates
wine$bound.sulfur.dioxide <- wine$total.sulfur.dioxide - wine$free.sulfur.dioxide
I found this artical so decided to group the wine as specified by the sweetness levels dictated by the EU http://en.wikipedia.org/wiki/Sweetness_of_wine#Residual_sugar
#Add groupings according to wine research.
wine$type <- ''
wine$type[wine$residual.sugar > 45] <- 'Sweet'
wine$type[wine$residual.sugar < 45
& wine$residual.sugar > 18] <- 'Medium'
wine$type[wine$residual.sugar < 9
& (wine$residual.sugar - wine$fixed.acidity < 2)] <- 'Dry'
wine$type[(wine$residual.sugar > 9
& wine$residual.sugar < 18)
& (wine$residual.sugar - wine$fixed.acidity < 10)] <- 'Medium Dry'
wine$type[(wine$residual.sugar < 4) & wine$type == ''] <- 'Dry'
wine$type[(wine$residual.sugar > 12
& wine$residual.sugar < 45) & wine$type == ''] <- 'Medium'
wine$type[(wine$residual.sugar > 4
& wine$residual.sugar < 12) & wine$type == ''] <- 'Medium Dry'
wine$type <- ordered(wine$type, levels = c("Dry", "Medium Dry", "Medium","Sweet"))
## Dry Medium Dry Medium Sweet
## 3442 1290 165 1
Judging by the sample it looks like this variety of grape is used to produce mainly dry, or medium dry wines. There are only a small proportion of medium and sweet wines in the sample.
There are some obvious correlations that I dont find very intresting such as, level of citric acid and fixed acidity, level of free sulphur dioxide and bound sulphur dioxide, level of fixed acidity and ph.
Quality seems to be affected to some extent in decreasing levels of influence by alcohol level (0.436 correlation) .., density (-0.307) .., chlorides(-0.21), bound.sulphur.dioxide (-0.21) and volatile acidity(-0.195), i would like to examine these more.
There appears to be a generally increase in quality of wine as alcohol level increases, as shown by the positive correlation (0.436). There appears more variance at higher and lower concentrations. There are strips in the scoring which are because of the integer quality scores.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## (7,8] (8,9] (9,10] (10,11] (11,12] (12,13] (13,14] (14,15]
## 2 500 1583 1252 850 609 100 2
## Source: local data frame [8 x 6]
##
## alcohol.bucket lower median upper mean n
## 1 (7,8] 3.5 4 4.5 4.000000 2
## 2 (8,9] 5.0 5 6.0 5.606000 500
## 3 (9,10] 5.0 5 6.0 5.487682 1583
## 4 (10,11] 5.0 6 6.0 5.864217 1252
## 5 (11,12] 6.0 6 7.0 6.190588 850
## 6 (12,13] 6.0 7 7.0 6.571429 609
## 7 (13,14] 6.0 7 7.0 6.720000 100
## 8 (14,15] 7.0 7 7.0 7.000000 2
Here I have split the alcohol levels into buckets, so that we can look at the data in a slightly different way. You see a very evidentially rise in the median quality scores as alcohol level increases.
Here is a plot removing the outliers (lower and upper 1%) and plotting a trend line. This shows the slight negative correlation between density and quality (-0.307).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most wines have chloride levels between 0.01 and 0.045. There are a several outliers with varing qualities of wine. In my later plots I removed the bottom 1% and top 4% from my plot. This shows that there is a small negative correlation between chlorides and quality (-0.21).
A high concentraion of bound sulphur dioxide appears to affect the quality of the wine. It displays the slight negative correlation (-0.21), however there are few data points at larger concentraions. The lata plot has the upper and lower 1% of points removed.
A high volatile.acidity concentration appears to affect the quality of the wine. There is a slight negative correlation (-0.195) however there are few data points at larger concentraions. The lata plot has the upper and lower 1% of points removed.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
There is a strong negative correlation between alcohol and density (-0.78).
The strongest correlation with quality was alcohol content. As alcohol content goes up quality tends to go up cor(0.436). You can see this more evidentally in the box plot that I produced which groups data by alcohol content in 1% alcohol increments.The mean for each of the groups increase as alcohol level increases.
Density also is negatively correlated to the quality of wine, which makes sense because alcohol and density are strongly negatively correlated and alcohol and quality are positively correlated.
The other variables I looked at showed some correlation. High concentration of bound sulphur dioxide, chlorides and volatile acidity tend to in advertantly affect the wine quality, and have small negative correlations with quality.
I would have expected the acid to sweetness ratio to correlate more with quality, it showed a very small negative correlation (-0.015). I would still like to explore this further in the multivariate section since research has indicated that this balance is important in wine quality.
Residual sugar appears negatively correlated with alcohol level, I assume this is because sweeter wines tend to be less strong?
The strongest correlation was between residual sugar and density (0.839). I assume this is because sugar molecules are more dense than water and so as the concentraion increases density increases.
Here is another plot matrix, this time I have used the type of wine to differentiate some of the points to see if this affects any of the relationships.
Lets convert quality to a factor since they are a fixed set of integers.
Now that ive converted to a factors, probably worth looking at the variable matrix again?